{ggstatsplot}: Informative Statistical Visualizations

Indrajeet Patil

Why {ggstatsplot}?

Current CRAN package count >23,000




{ggstatsplot} provides

📊 information-rich plots with statistical details

📝 suitable for faster (exploratory) data analysis and reporting

Informative graphic = a thousand words

Graphical summaries can reveal problems not visible from numerical statistics.

Ready-made plot = no customization

The grammar of graphics is a powerful framework (Wilkinson, 2011) and can help you make any graphics fitting your specific data visualization needs! But…

Quality of Life (QoL) improvements with {ggstatsplot}

Provide ready-made plots with defaults following the best practices in statistical reporting and data visualization.

Simpler/faster data analysis workflow

In a typical exploratory data analysis workflow, data visualization and statistical modeling are two different phases: visualization informs modeling, and modeling can suggest a different visualization, and so on and so forth.

Central idea of {ggstatsplot}

Simple: combine these two phases into one!

And a LOT more!

…but we will come back to that later 📌

Let’s get started first!


Package available for installation on CRAN and GitHub:

Type Command
Release install.packages("ggstatsplot")
Development pak::pak("IndrajeetPatil/ggstatsplot")

Example function

ggbetweenstats()

For between-group comparisons

ggbetweenstats(
  data  = iris,
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)

Important

✏️ Defaults

  • raw data + distributions
  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

Other functions

Benefits for Statistical Reporting

Results in context of the data 🕵️

Standard approach

Pearson’s correlation test revealed that, across 142 participants, variable x was negatively correlated with variable y: \(t(140)=-0.76, p=.446\). The effect size \((r=-0.06, 95\% CI [-.23,.10])\) was small, as per Cohen’s (1988) conventions. The Bayes Factor for the same analysis revealed that the data were 5.81 times more probable under the null hypothesis as compared to the alternative hypothesis. This can be considered moderate evidence (Jeffreys, 1961) in favor of the null hypothesis (absence of any correlation between x and y).

{ggstatsplot} approach

Toggling statistical approaches 🔀

Parametric

# anova
ggbetweenstats(
  data = mtcars,
  x = cyl,
  y = wt,
  type = "p" 
)

# correlation analysis
ggscatterstats(
  data = mtcars,
  x = wt,
  y = mpg,
  type = "p" 
)

# t-test
gghistostats(
  data = mtcars,
  x = wt,
  test.value = 2,
  type = "p" 
)

Non-parametric

# anova
ggbetweenstats(
  data = mtcars,
  x = cyl,
  y = wt,
  type = "np" 
)

# correlation analysis
ggscatterstats(
  data = mtcars,
  x = wt,
  y = mpg,
  type = "np" 
)

# t-test
gghistostats(
  data = mtcars,
  x = wt,
  test.value = 2,
  type = "np" 
)

Alternative: Pure Pain

Hunting for packages

📦 for inferential statistics ({stats})
📦 computing effect size + CIs ({effectsize})
📦 for descriptive statistics ({skimr})
📦 pairwise comparisons ({multcomp})
📦 Bayesian hypothesis testing ({BayesFactor})
📦 Bayesian estimation ({bayestestR})
📦 …

Inconsistent APIs

🤔 accepts data frame, vector, matrix?
🤔 long/wide format data?
🤔 works with NAs?
🤔 returns data frame, vector, matrix?
🤔 works with tibbles?
🤔 has all necessary details?
🤔 …

Customizability

“What if I don’t like the default plots?” 🤔

Changing aesthetics 🎨

ggbetweenstats(
  data = movies_long,
  x = mpaa,
  y = rating,
  ggtheme = ggthemes::theme_economist(), 
  palette = "Darjeeling2", 
  package = "wesanderson" 
)

Aesthetic preferences not an excuse to avoid {ggstatsplot}! 😻 Any ggplot theme or palette can be used.

N.B. The default palette is colorblind-friendly.

Modification with {ggplot2} 🛠

You can modify {ggstatsplot} plots further using {ggplot2} functions. 🎉

ggbetweenstats(
  data = mtcars,
  x = am,
  y = wt,
  type = "bayes"
) +
  scale_y_continuous(sec.axis = dup_axis()) 

Too much information 🙈

Get only plots:

ggbetweenstats(
  data = iris,
  x = Species,
  y = Sepal.Length,
  # turn off statistical analysis
  centrality.plotting = FALSE, 
  results.subtitle = FALSE, 
  bf.message = FALSE, 
  # turn off pairwise comparisons
  pairwise.display = "none" 
)

Get only expressions:

stats_expr <- ggpiestats(
  Titanic_full, Survived, Sex,
) %>% extract_subtitle()

ggiraphExtra::ggSpine( 
  data = Titanic_full,
  aes(x = Sex, fill = Survived)
) +
  labs(subtitle = stats_expr)  

Critical Evaluation

Things to be wary of

“Golem of Prague” issue

Promotes mindless application of statistical tests.

Easy-to-use software can lead to misuse.

Clunky API

  • Too many arguments to remember.
  • Not a “real” {ggplot2} extension.
  • Limited number of functions.
  • Statistical proficiency needed.

Attractive Qualities

Things that will pull you in

Quality Assurance

Each commit must pass many QA checks:

CI Checks (GitHub Actions)

  • Unit tests (random-order)
  • Code coverage (100%)
  • Linting (0 lints)
  • Formatting (0 issues)
  • Documentation (website, link rot, examples)
  • CRAN checks (0 E, 0 W, 0 N)
  • Pre-commit hooks (0 issues)
  • Portability (Linux, macOS, Windows)
  • Robustness (dependencies, R versions)

Healthy and active code base

User Love

Total downloads > 500K (97 percentile)

library(packageRank)
plot(
  cranDownloads("ggstatsplot", from = "2018-04-03", to = Sys.Date()),
  graphics = "ggplot2", smooth = TRUE
)

Total citations > 1000

Conclusion

Benefits of the {ggstatsplot} approach

{ggstatsplot}, a package that combines data visualization and statistical analysis in a single step, is a powerful tool that:

  • provides ready-made plots with defaults that are information-rich
  • minimizes the chances of making errors in statistical reporting
  • follows best practices in data visualization and statistical reporting
  • highlights the importance of the effect by providing effect size measures by default
  • provides an easy way to evaluate absence of an effect using Bayesian framework
  • helps evaluate statistical analysis in the context of the underlying data
  • easy and simple enough that somebody with little coding experience can use it without making an error

For more

Source code for these slides can be found on GitHub.

If you are interested in good programming and software development practices, check out my other slide decks.

Find me at…

Twitter

LikedIn

GitHub

Website

E-mail

Thank You 😊

Session information

sessioninfo::session_info(include_base = TRUE)
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       Ubuntu 22.04.5 LTS
 system   x86_64, linux-gnu
 hostname fv-az1052-290
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       UTC
 date     2024-11-10
 pandoc   3.5 @ /opt/hostedtoolcache/pandoc/3.5/x64/ (via rmarkdown)
 quarto   1.6.33 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package          * version     date (UTC) lib source
 base             * 4.4.2       2024-10-31 [3] local
 BayesFactor        0.9.12-4.7  2024-01-24 [1] RSPM
 bayestestR         0.15.0      2024-10-17 [1] RSPM
 bitops             1.0-9       2024-10-03 [1] RSPM
 BWStest            0.2.3       2023-10-10 [1] RSPM
 cachem             1.1.0       2024-05-16 [1] RSPM
 cli                3.6.3       2024-06-21 [1] RSPM
 coda               0.19-4.1    2024-01-31 [1] RSPM
 colorspace         2.1-1       2024-07-26 [1] RSPM
 compiler           4.4.2       2024-10-31 [3] local
 correlation        0.8.6       2024-10-26 [1] RSPM
 cranlogs           2.1.1       2019-04-29 [1] RSPM
 curl               6.0.0       2024-11-05 [1] RSPM
 data.table         1.16.2      2024-10-10 [1] RSPM
 datasets         * 4.4.2       2024-10-31 [3] local
 datawizard         0.13.0      2024-10-05 [1] RSPM
 digest             0.6.37      2024-08-19 [1] RSPM
 dplyr              1.1.4       2023-11-17 [1] RSPM
 effectsize         0.8.9       2024-07-03 [1] RSPM
 evaluate           1.0.1       2024-10-10 [1] RSPM
 fansi              1.0.6       2023-12-08 [1] RSPM
 farver             2.1.2       2024-05-13 [1] RSPM
 fastmap            1.2.0       2024-05-15 [1] RSPM
 generics           0.1.3       2022-07-05 [1] RSPM
 ggiraph            0.8.10      2024-05-17 [1] RSPM
 ggiraphExtra       0.3.0       2020-10-06 [1] RSPM
 ggplot2          * 3.5.1       2024-04-23 [1] RSPM
 ggrepel            0.9.6       2024-09-07 [1] RSPM
 ggsignif           0.6.4       2022-10-13 [1] RSPM
 ggstatsplot      * 0.12.5.9000 2024-11-10 [1] Github (IndrajeetPatil/ggstatsplot@b7350e9)
 ggthemes           5.1.0       2024-02-10 [1] RSPM
 glue               1.8.0       2024-09-30 [1] RSPM
 gmp                0.7-5       2024-08-23 [1] RSPM
 graphics         * 4.4.2       2024-10-31 [3] local
 grDevices        * 4.4.2       2024-10-31 [3] local
 grid               4.4.2       2024-10-31 [3] local
 gtable             0.3.6       2024-10-25 [1] RSPM
 htmltools          0.5.8.1     2024-04-04 [1] RSPM
 htmlwidgets        1.6.4       2023-12-06 [1] RSPM
 httr               1.4.7       2023-08-15 [1] RSPM
 insight            0.20.5      2024-10-02 [1] RSPM
 jsonlite           1.8.9       2024-09-20 [1] RSPM
 knitr              1.49        2024-11-08 [1] RSPM
 kSamples           1.2-10      2023-10-07 [1] RSPM
 labeling           0.4.3       2023-08-29 [1] RSPM
 lattice            0.22-6      2024-03-20 [3] CRAN (R 4.4.2)
 lifecycle          1.0.4       2023-11-07 [1] RSPM
 lubridate          1.9.3       2023-09-27 [1] RSPM
 magrittr           2.0.3       2022-03-30 [1] RSPM
 MASS               7.3-61      2024-06-13 [3] CRAN (R 4.4.2)
 Matrix             1.7-1       2024-10-18 [3] CRAN (R 4.4.2)
 MatrixModels       0.5-3       2023-11-06 [1] RSPM
 memoise            2.0.1       2021-11-26 [1] RSPM
 methods          * 4.4.2       2024-10-31 [3] local
 mgcv               1.9-1       2023-12-21 [3] CRAN (R 4.4.2)
 multcompView       0.1-10      2024-03-08 [1] RSPM
 munsell            0.5.1       2024-04-01 [1] RSPM
 mvtnorm            1.3-2       2024-11-04 [1] RSPM
 mycor              0.1.1       2018-04-10 [1] RSPM
 nlme               3.1-166     2024-08-14 [3] CRAN (R 4.4.2)
 packageRank      * 0.9.3       2024-10-16 [1] RSPM
 paletteer          1.6.0       2024-01-21 [1] RSPM
 parallel           4.4.2       2024-10-31 [3] local
 parameters         0.23.0      2024-10-18 [1] RSPM
 patchwork          1.3.0       2024-09-16 [1] RSPM
 pbapply            1.7-2       2023-06-27 [1] RSPM
 performance        0.12.4      2024-10-18 [1] RSPM
 pillar             1.9.0       2023-03-22 [1] RSPM
 pkgconfig          2.0.3       2019-09-22 [1] RSPM
 pkgsearch          3.1.3       2023-12-10 [1] RSPM
 plyr               1.8.9       2023-10-02 [1] RSPM
 PMCMRplus          1.9.12      2024-09-08 [1] RSPM
 ppcor              1.1         2015-12-03 [1] RSPM
 prismatic          1.1.2       2024-04-10 [1] RSPM
 purrr              1.0.2       2023-08-10 [1] RSPM
 R.methodsS3        1.8.2       2022-06-13 [1] RSPM
 R.oo               1.27.0      2024-11-01 [1] RSPM
 R.utils            2.12.3      2023-11-18 [1] RSPM
 R6                 2.5.1       2021-08-19 [1] RSPM
 RColorBrewer       1.1-3       2022-04-03 [1] RSPM
 Rcpp               1.0.13-1    2024-11-02 [1] RSPM
 RCurl              1.98-1.16   2024-07-11 [1] RSPM
 rematch2           2.1.2       2020-05-01 [1] RSPM
 reshape2           1.4.4       2020-04-09 [1] RSPM
 rlang              1.1.4       2024-06-04 [1] RSPM
 rmarkdown          2.29        2024-11-04 [1] RSPM
 Rmpfr              0.9-5       2024-01-21 [1] RSPM
 scales             1.3.0       2023-11-28 [1] RSPM
 sessioninfo        1.2.2.9000  2024-11-10 [1] Github (r-lib/sessioninfo@37c81af)
 sjlabelled         1.2.0       2022-04-10 [1] RSPM
 sjmisc             2.8.10      2024-05-13 [1] RSPM
 splines            4.4.2       2024-10-31 [3] local
 stats            * 4.4.2       2024-10-31 [3] local
 statsExpressions   1.6.1       2024-10-31 [1] RSPM
 stringi            1.8.4       2024-05-06 [1] RSPM
 stringr            1.5.1       2023-11-14 [1] RSPM
 sugrrants          0.2.9       2024-03-12 [1] RSPM
 SuppDists          1.1-9.8     2024-09-03 [1] RSPM
 systemfonts        1.1.0       2024-05-15 [1] RSPM
 tibble             3.2.1       2023-03-20 [1] RSPM
 tidyr              1.3.1       2024-01-24 [1] RSPM
 tidyselect         1.2.1       2024-03-11 [1] RSPM
 timechange         0.3.0       2024-01-18 [1] RSPM
 tools              4.4.2       2024-10-31 [3] local
 utf8               1.2.4       2023-10-22 [1] RSPM
 utils            * 4.4.2       2024-10-31 [3] local
 uuid               1.2-1       2024-07-29 [1] RSPM
 vctrs              0.6.5       2023-12-01 [1] RSPM
 withr              3.0.2       2024-10-28 [1] RSPM
 xfun               0.49        2024-10-31 [1] RSPM
 yaml               2.3.10      2024-07-26 [1] RSPM
 zeallot            0.1.0       2018-01-28 [1] RSPM

 [1] /home/runner/work/_temp/Library
 [2] /opt/R/4.4.2/lib/R/site-library
 [3] /opt/R/4.4.2/lib/R/library
 * ── Packages attached to the search path.

──────────────────────────────────────────────────────────────────────────────

Appendix

Examples of other functions

ggwithinstats()

Hypothesis about group differences: repeated measures design

ggwithinstats(
  data = WRS2::WineTasting,
  x = Wine,
  y = Taste
)

Important

✏️ Defaults

  • raw data + distributions
  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

gghistostats()

Distribution of a numeric variable

gghistostats(
  data = movies_long,
  x = budget,
  test.value = 30 
)

Important

✏️ Defaults

  • counts + proportion for bins
  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

ggdotplotstats()

Labeled numeric variable

ggdotplotstats(
  data = movies_long,
  x = budget,
  y = genre,
  test.value = 30 
)

Important

✏️ Defaults

  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

ggscatterstats()

Hypothesis about correlation: Two numeric variables

ggscatterstats(
  data = movies_long,
  x = budget,
  y = rating
)

Important

✏️ Defaults

  • joint distribution
  • marginal distribution
  • effect size + uncertainty
  • pairwise comparisons
  • Bayesian hypothesis-testing
  • Bayesian estimation

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

ggcorrmat()

Hypothesis about correlation: Multiple numeric variables

ggcorrmat(dplyr::starwars)

Important

✏️ Defaults

  • inferential statistics
  • effect size + uncertainty
  • careful handling of NAs
  • partial correlations

Statistical approaches available

  • parametric
  • parametric
  • robust
  • Bayesian

ggpiestats()

Hypothesis about composition of categorical variables

ggpiestats(
  data = mtcars,
  x = am,
  y = cyl
)

Important

✏️ Defaults

  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • goodness-of-fit tests
  • Bayesian hypothesis-testing
  • Bayesian estimation

ggbarstats()

Hypothesis about composition of categorical variables

ggbarstats(
  data = mtcars,
  x = am,
  y = cyl
)

Important

✏️ Defaults

  • descriptive statistics
  • inferential statistics
  • effect size + uncertainty
  • goodness-of-fit tests
  • Bayesian hypothesis-testing
  • Bayesian estimation

ggcoefstats()

Hypothesis about regression coefficients

mod <- lm(
  formula = rating ~ mpaa,
  data = movies_long
)

ggcoefstats(mod)

Important

✏️ Defaults

  • estimate + uncertainty
  • inferential statistics (\(t\), \(z\), \(F\), \(\chi^2\))
  • model fit indices (AIC + BIC)

Supports all regression models supported in {easystats} ecosystem.

Meta-analysis is also supported!

grouped_ variants

Iterating over a grouping variable

grouped_ functions

grouped_ggpiestats(
  data = mtcars,
  x = cyl,
  grouping.var = am 
)

Available grouped_ variants:

  • grouped_ggbetweenstats()
  • grouped_ggwithinstats()
  • grouped_gghistostats()
  • grouped_ggdotplotstats()
  • grouped_ggscatterstats()
  • grouped_ggcorrmat()
  • grouped_ggpiestats()
  • grouped_ggbarstats()

More {ggstatsplot} benefits

Supports different statistical approaches

Note

Functions Description Parametric Non-parametric Robust Bayesian
ggbetweenstats() Between group comparisons
ggwithinstats() Within group comparisons
gghistostats(), ggdotplotstats() Distribution of a numeric variable
ggcorrmat() Correlation matrix
ggscatterstats() Correlation between two variables
ggpiestats(), ggbarstats() Association between categorical variables NA NA
ggpiestats(), ggbarstats() Equal proportions for categorical variable levels NA NA
ggcoefstats() Regression modeling
ggcoefstats() Random-effects meta-analysis NA

Best practices in statistical reporting 🏆

Avoiding reporting errors

“half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion”

(Nuijten et al., Behavior Research Methods, 2016)

Since the plot and the statistical analysis are yoked together, the chances of making an error in reporting the results are minimized.

No need to worry about updating figures and statistical details separately. 🔗

Making sense of null results

\(p > 0.05\): The null hypothesis (H0) can’t be rejected

But can it be accepted?! Null Hypothesis Significance Testing 🤫

“In 72% of cases, nonsignificant results were misinterpreted, in that the authors inferred that the effect was absent. A Bayesian reanalysis revealed that fewer than 5% of the nonsignificant findings provided strong evidence (i.e., \(BF_{01} > 10\)) in favor of the null hypothesis over the alternative hypothesis.”

(Aczel et al., AMPPS, 2018)

Juxtaposing frequentist and Bayesian statistics for the same analysis helps to properly interpret the null results.

A few other benefits

Minimal code needed (data, x, y): minimizes chances of error + tidy scripts. 💅

Disembodied figures stand on their own and are easy to evaluate. 🧐

More breathing room for theoretical discussion and other text. ✍

Misconceptions: This package is…


❌ an alternative to learning ggplot2
✅ the more you know ggplot2, the better you can modify the defaults to your liking)

❌ meant to be used in talks/presentations
✅ defaults too complicated for effectively communicating results in time-constrained presentation settings, e.g. conference talks)

❌ only relevant when used in publications
✅ not necessary; can also be useful only during exploratory phase

❌ the only game in town
✅ excellent GUI open-source software: JASP and jamovi)